-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: add AWS HealthOmics data store management tools #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add AWS HealthOmics data store management tools #1498
Conversation
- Add sequence store tools for read set operations and imports - Add variant store tools for variant search and import operations - Add reference store tools for reference genome management - Add annotation store tools for annotation search and imports - Add data import tools for S3 integration and file discovery - Update server.py to register all new tools with AHO naming convention - Enhance server instructions with complete data management capabilities Resolves awslabs#1421: Enhances AWS HealthOmics MCP server with data store management
- Add test coverage for sequence store tools - Add test coverage for data import and S3 integration tools - Update server tests to include all new data store tools - Include both success and error handling test cases - Verify proper AWS API integration patterns Tests cover: - Sequence store operations (list, get, import) - S3 file discovery and validation - Data import source preparation - Error handling for AWS API failures
Applied pre-commit hooks to ensure compliance with AWS contribution standards: - Fixed trailing whitespace and end-of-file formatting - Applied ruff code formatting - Added allowlist comment for ETag value to address secret detection All code now passes pre-commit checks and is ready for review. Fixes awslabs#1421 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive data store management tools to the AWS HealthOmics MCP server, expanding from workflow-only operations to include full genomic data lifecycle management. The enhancement adds 33 new tools across sequence stores, variant stores, reference stores, annotation stores, and S3 integration capabilities.
- Enables complete genomic analysis workflows from data discovery through results analysis
- Adds auto-discovery of genomic files in S3 with validation and import preparation
- Provides comprehensive data store operations for managing genomic datasets
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_server.py | Updated test to validate all 33 new tools are properly registered |
| tests/test_sequence_store_tools.py | Comprehensive test suite for sequence store operations including import jobs |
| tests/test_data_import_tools.py | Test coverage for S3 integration and genomic file discovery functionality |
| awslabs/aws_healthomics_mcp_server/tools/variant_store_tools.py | Variant store management including search, count, and import operations |
| awslabs/aws_healthomics_mcp_server/tools/sequence_store_tools.py | Sequence store operations for managing read sets and import jobs |
| awslabs/aws_healthomics_mcp_server/tools/reference_store_tools.py | Reference genome management and import functionality |
| awslabs/aws_healthomics_mcp_server/tools/data_import_tools.py | S3 integration utilities for file discovery and import preparation |
| awslabs/aws_healthomics_mcp_server/tools/annotation_store_tools.py | Annotation store management for genomic annotations |
| awslabs/aws_healthomics_mcp_server/server.py | Tool registration and enhanced documentation for all new capabilities |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| mock_response = { | ||
| 'ContentLength': 1024000, | ||
| 'LastModified': datetime(2023, 10, 1, 12, 0, 0), | ||
| 'ETag': '"abc123def456"', # pragma: allowlist secret |
Copilot
AI
Oct 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should use 'allow list' (two words) instead of 'allowlist' (one word).
| 'ETag': '"abc123def456"', # pragma: allowlist secret | |
| 'ETag': '"abc123def456"', # pragma: allow list secret |
|
Hi @peterbb148, thanks for the contribution, I think this adds some valuable tools to the MCP to close some of the gap between the HealthOmics API and the MCP. The S3 search tool that you have added might be redundant with another PR that we have in process which adds a multi-bucket search, includes healthomics stores and associates and groups files. #1501. Can you take a look at this and see if the overlap would make the proposed tool redundant? |
|
I also noticed this lint failure, can you address that? |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1498 +/- ##
==========================================
- Coverage 89.72% 89.02% -0.70%
==========================================
Files 735 665 -70
Lines 52491 49426 -3065
Branches 8445 8090 -355
==========================================
- Hits 47098 44003 -3095
- Misses 3453 3492 +39
+ Partials 1940 1931 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I kicked off the build/ test pipeline. Looks like some errors are occurring in unit tests. https://github.com/awslabs/mcp/actions/runs/18415491876/job/52647610592?pr=1498 |
Response to PR #1501 Overlap ConcernThanks for flagging the potential overlap with PR #1501. I've reviewed both PRs and here's my analysis: Overlap AssessmentPR #1501 provides:
This PR (1498) provides:
Tools with Overlap:
Unique Tools (No Overlap):
RecommendationI propose removing the overlapping tools ( The core value of this PR remains:
Users would use PR #1501's Would this approach work? I'm happy to remove the overlapping tools if that makes sense. |
Resolves test failures where Pydantic FieldInfo objects were being passed instead of actual values when tests call functions directly. Changes: - Add _get_value() helper to extract actual values from FieldInfo objects - Update list_aho_sequence_stores() to handle FieldInfo in next_token - Update list_aho_read_sets() to handle FieldInfo in all optional params - Change truthiness checks to explicit None checks Fixes: - test_list_aho_sequence_stores_success: Expected call not found - test_list_aho_read_sets_success: 'FieldInfo' object has no attribute 'replace'
All PR comments addressed! ✅1. PR #1501 Overlap AnalysisAdded comment with detailed analysis. The overlapping S3 discovery tools (
2. Contributor Statement AddedAdded the required acknowledgment to the PR description as requested. 3. Test Failures FixedFixed both failing tests: Issue: When tests call functions directly (bypassing FastMCP), Pydantic Solution:
Test Results: All 7 tests in the test suite now pass! 🎉 The PR should now pass CI checks. Ready for review! |
| except botocore.exceptions.ClientError as e: | ||
| error_code = e.response['Error']['Code'] | ||
| error_message = e.response['Error']['Message'] | ||
|
|
||
| logger.error(f'Failed to list annotation stores: {error_code} - {error_message}') | ||
|
|
||
| raise Exception(f'Failed to list annotation stores: {error_code} - {error_message}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The same error-handling logic appears in each function (30+ instances), with only the messages varying. It might be worth abstracting this into a shared helper to reduce duplication and simplify maintenance.
| Exception: If there's an error retrieving annotation store information | ||
| """ | ||
| try: | ||
| client = get_omics_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_omics_client() currently creates a new connection every time. When called in every function in the aho tools/, a new connection is being created every time which creates overhead. Have we considered reusing the connections by perhaps client caching or connection pooling?
| from urllib.parse import urlparse | ||
|
|
||
|
|
||
| def parse_s3_uri(s3_uri: str) -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @markjschreiber mentioned earlier, there are some similarities with #1501, which was merged shortly after this was published.
Several of these util functions now exist in main with equivalent functionality. I’d recommend doing another quick pass to see if any can be consolidated. For instance, parse_s3_uri could be replaced or simplified by adjusting the output of the existing s3_utils.py implementation.
| try: | ||
| client = get_omics_client() | ||
|
|
||
| response = client.get_annotation_store(name=annotation_store_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consider adding retry logic for all API calls to handle transient AWS errors or rate-limiting scenarios.
| if chromosome: | ||
| # Normalize chromosome format | ||
| if not chromosome.startswith('chr'): | ||
| chromosome = f'chr{chromosome}' | ||
| filter_criteria['contigName'] = {'eq': chromosome} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Might want to add a validator to make sure the chromosome values provided are actually valid.
| ge=1, | ||
| le=100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we want to make these values configurable?
|
|
||
|
|
||
| async def start_aho_annotation_import_job( | ||
| ctx: Context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my own learning, how does the context get applied if it's not referenced in the function body?
Summary
This PR enhances the AWS HealthOmics MCP server with comprehensive data store management capabilities, adding 33 new tools that complement the existing workflow management functionality. The enhancement provides a complete genomic analysis platform covering both workflow execution and data operations.
New Features
Data Store Operations
S3 Integration & Data Discovery
Technical Implementation
tools/directoryFiles Added/Modified
New Tool Modules
awslabs/aws_healthomics_mcp_server/tools/sequence_store_tools.pyawslabs/aws_healthomics_mcp_server/tools/variant_store_tools.pyawslabs/aws_healthomics_mcp_server/tools/reference_store_tools.pyawslabs/aws_healthomics_mcp_server/tools/annotation_store_tools.pyawslabs/aws_healthomics_mcp_server/tools/data_import_tools.pyUpdated Core Files
awslabs/aws_healthomics_mcp_server/server.py- Tool registration and enhanced documentationtests/test_server.py- Updated tool validationTest Coverage
tests/test_sequence_store_tools.py- Comprehensive sequence store testingtests/test_data_import_tools.py- S3 integration and file discovery testsUsage Workflow
This enhancement enables complete genomic analysis workflows:
Compliance
Test Plan
Fixes #1421
Acknowledgment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of the project license.